Skip to content

Conversation

@ggerganov
Copy link
Member

@ggerganov ggerganov commented Jun 2, 2025

ref #13963

To properly support his, we first need to fix this TODO:

llama_memory_state_ptr llama_kv_cache_unified_iswa::init_batch(const llama_batch & batch, uint32_t n_ubatch, bool embd_pooled, bool logits_all) {
GGML_UNUSED(embd_pooled);
// TODO: if we fail with split_simple, we should attempt different splitting strategies
// but to do that properly, we first have to refactor the batches to be more flexible
auto sbatch = llama_sbatch(batch, hparams.n_embd, true, logits_all);
std::vector<llama_ubatch> ubatches;

@ggerganov ggerganov changed the title server : use swa-full fo draft context server : disable speculative decoding for SWA models Jun 2, 2025
@ggerganov ggerganov marked this pull request as ready for review June 2, 2025 18:07
@ggerganov ggerganov requested a review from ngxson as a code owner June 2, 2025 18:07
@ggerganov ggerganov merged commit 3637576 into master Jun 2, 2025
46 checks passed
@ggerganov ggerganov deleted the gg/server-spec-swa-clear branch June 2, 2025 18:34
furyhawk pushed a commit to furyhawk/llama.cpp that referenced this pull request Jun 6, 2025
* server : use swa-full fo draft context

ggml-ci

* server : disable speculative decoding for SWA models
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants